Introduction

The purpose of this pipeline is to provide a set of additional analyses to complement the analysis done by Metabolon. These additional analysis are broken down into four main sections.

  1. Data Exploration

  2. Sub-pathway Analysis

  3. Pairwise comparisons

  4. Plots

img = image_read("../Workflow.png")

image_trim(img)

The R “chunks” separate each of the analysis steps within the main sections of the pipeline. In some chunks, there may need to be lines of code that need to change from experiment to experiment. This pipeline provides a list of steps for each chunk and highlights places where you need to make changes.

Normalization and Standardization

In its raw form, the peak data contain counts for each metabolite in each sample. Standardization and normalization are important steps to improve the signal-to-noise ratio. In the next section, we are implementing the same standardization and normalization methods used by Metabolon. First we will load the raw peak data into R. To start we:

  1. Provide a path with the file name to the Data tables .xlsx file provided by Metabolon. You may need to change this file path to reflect the name of the data tables .xlsx file for your project.

  2. Read in the “Peak Area Data” tab from the .xlsx file provided above.

# Provide a path to Metabolon .xlsx file. 
metabolon_path <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"


# 2. Load raw peak data
peak_data <- read.xlsx(metabolon_path, sheet = "Peak Area Data")

Median Standardization

The first step in the Metabolon normalization process is median standardization. This step will ensure that each metabolite in the dataset has the same median value. To show this we will show the box plots for 5 metabolites before and after median normalization. Below is the box plots before median standardization where we:

  1. Select the first five metabolites for the box plot.

  2. Create the boxplot data

  3. Plot the box plots for each of the five metabolites.

# 1. Select the first five metabolites for the box plot. 
metabolites <- names(peak_data)[2:6]


# 2. Create the boxplot data
plot_data <- peak_data %>%
  reshape2::melt(id.vars="PARENT_SAMPLE_NAME") %>%
  filter(variable %in% metabolites)


# 3. Plot the box plots for each of the five metabolites. 
ggplot(plot_data,aes(x=variable,y=value)) +
  geom_boxplot() +
  theme_bw()
## Warning: Removed 23 rows containing non-finite values (`stat_boxplot()`).

Now we perform median standardization by:

  1. Initializing a new peak_data_norm matrix

  2. Create a data set containing the median value for each metabolite

  3. Divide each value by the median metabolite value.

# 1. Initialize a new peak_data_med matrix
peak_data_norm <-  peak_data 


# 2. Create a matrix containing the median value for each metabolite 
peak_data_med <- peak_data_norm %>%
  select(-PARENT_SAMPLE_NAME) %>%
  summarise_all(median, na.rm = T)


# 3. Divide each value for each metabolite by the median value of that metabolite
for(i in colnames(peak_data_med)){
  peak_data_norm[,i] <- peak_data_norm[,i]/peak_data_med[,i]
}

Now we can look at those same metabolites from before. We will notice that the median value for all metabolites are now lined up at 1.

# 1. Select the first five metabolites for the box plot. 
metabolites <- names(peak_data)[2:6]


# 2. Create the boxplot data
plot_data <- peak_data_norm %>%
  reshape2::melt(id.vars="PARENT_SAMPLE_NAME") %>%
  filter(variable %in% metabolites)


# 3. Plot the box plots for each of the five metabolites. 
ggplot(plot_data,aes(x=variable,y=value)) +
  geom_boxplot() +
  theme_bw()
## Warning: Removed 23 rows containing non-finite values (`stat_boxplot()`).

Minimum value imputation

Once we standardize each metabolite by the median value, we impute the missing values for each metabolite with the minimum value for that metabolite. We do this in the following steps.

  1. Initialize the new peak_data_imputed matrix

  2. Find the minimum value for each metabolite.

  3. Create a for loop such that if a metabolite has a missing observation, then the minimum value for that metabolite is imputed.

# 1. Initialize the new peak_data_imputed matrix
peak_data_imputed <- peak_data_norm


# 2. Find the minimum value for each metabolite and compute 1/5 of that value
peak_data_mins <- peak_data_imputed %>%
  select(-PARENT_SAMPLE_NAME) %>%
  summarise_all(min, na.rm = T) 


# 3. Impute the value
for(i in colnames(peak_data_mins)){
  if(length(peak_data_imputed[,i][is.na(peak_data_imputed[,i])]) > 0){
    peak_data_imputed[which(is.na(peak_data_imputed[,i])),i] <- as.numeric(peak_data_mins[i])
  }
}

Log Transformation

The final step is to take a natural log transformation of each of the values.

  1. Log transform all of the values
# 1. Log transform all of the values
peak_data_log <- peak_data_imputed %>%
  mutate_if(is.numeric, log)

Save data

Now that we have imputed and normalized our data, we need to save the data for subsequent sections. We will save peak_data_log in the “Data/Processed” Folder under the file name “Log_transformed_data.csv”.

# 1. Save log transformed data.
write.csv(peak_data_log, file = "../Data/Processed/Log_transformed_data.csv",
          row.names = F)

Analysis Data Creation

For the downstream analysis we will need merge the sample metadata and the log transformed peak data to create one analysis dataset. In the normalization and standardization section we took the raw peak data and performed median value standardization, minimum value imputation and natural log transformation. Metabolon does these preprocessing steps for their customers and provides the log transformed data for their customers in a .xlsx. We recommend utilizing the preprocessed log transformed provided by Metabolon. By utilizing the Log Transformed data derived by Metabolon, we can ensure the down stream supplemental analysis are utilizing the same data as Metabolon.

We need to merge the sample metadata and the Log normalized data together. Additionally, we want to replace the column names with the names of the metabolites. To do this we are going to read in the sample metadata and the log normalized data. We will do this by:

  1. Provide a path to the metabolon .xlsx file.

  2. Read in the sample metadata under the “Sample Meta Data” tab.

  3. Read in the Log transformed data from the “Log Transformed Data” tab.

# 1. Metabolon excel file 
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"


# 2. Read in sample metadata
meta_data <- read.xlsx(met_excel,sheet = "Sample Meta Data")


# 3. Read Log transformed data
peak_data_log <- read.xlsx(met_excel,sheet = "Log Transformed Data")

If there are additional variables that need to be included into the analysis, you can place these variables in a .xlsx file. You must include a variable called “PARENT_SAMPLE_NAME” that links the metadata to each of the samples.

In the following chunk we merge the additional metadata variables to the current metadata. We do this in the following steps.

  1. Provide a path to the additional metadata variables. If you do not have any additional metadata variables, you can set the “additional_meta=NULL”.

  2. If there are additional meta data variables, then merge the additional metadata variables with the metadata from Metabolon.

# 1. Provide path to additional metadata
additional_meta <- "../Data/Metabolon/AdditionalVars.xlsx"


# 2. Merge additional vars to the meta data
if(!is.null(additional_meta)){
  meta_data <- read.xlsx(additional_meta) %>%
    left_join(meta_data,"PARENT_SAMPLE_NAME")
}

To create the analysis data, we want the metadata and the log transformed peak data merged so each row of the analysis data set is a different sample and the columns contain the sample metadata and the log transformed peak data. To do this we:

  1. Select the metadata variables need for down stream analysis. The metadata provided by Metabolon includes variables which may not be necessary for the analysis. In this section will select the variables in the metadata that are necessary for downstream analysis. To do this we will need to identify the variables we want to keep. This will include the PARENT_SAMPLE_NAME and the experiment group variables. Additionally, if your experiment includes males and females, we recommend including a Sex/Gender variable and stratify the analysis.

  2. Merge the metadata and the log transformed data together while keeping only the metadata variables needed for the analysis.

# 1. Select metadata variables
metadata_variables <- c("PARENT_SAMPLE_NAME", 
                        "GROUP_NAME",
                        "TIME1",
                        "Gender")


# 2. Create analysis data
analysis_data <- meta_data %>%
  select(all_of(metadata_variables)) %>%
  left_join(peak_data_log,"PARENT_SAMPLE_NAME")

Once we have the analysis data, we want to save it so we can use it for the downstream analysis. We will save this in the “Data/Processed” folder under the file name “analysis_data.csv”

# Save analysis data
write.csv(analysis_data,"../Data/Processed/analysis_data.csv", row.names = F)

Data Exploration

In data exploration, we use several methods to help us better understand the underlying patterns in the data without using a formal hypothesis test. In this pipeline, we are going to focus on two methods for data exploration:

A.) Principal component analysis

B.) Heat maps

Before diving into the data exploration, we need read the analysis data set. Note, we use check.names=F since column names of the analysis data are numeric.

# 1. Read analysis dataset
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)

Table 1

The goal of this section is to design a table that resembles the study design of your experiment. This step allows us to see descriptive statistics of the samples in our data. For this, we are using the analysis data to define a table with a similar structure as the table below.

Strat 1 Strat 2
Var1 XX XX
Var 2 XX XX
Var 3 XX XX

Alternatively, we can define a table non-stratified table with only the number of observations within each experimental group. The following code will create the desired table 1 for both scenarios.

In this chunk, we will label the analysis data variable names. Labeling the variable names will allow the tables to display your chosen variable names while maintaining the original variable name within the data. We do this in two steps.

  1. Choose the variable name.
  • Change the name assigned to var1 to match the name of the variable in the analysis data set which you want to include in the table.

  • Repeat this for as many variables as needed.

  1. Choose a single variable to stratify the table by.

  2. Assign the variable a label name

  • you can change the label of the variable in the table by relabeling the variable to the desired name on the right side of the equation.
  1. Assign a stratified variable a label name for the table.
# 1. Choose the meta data variable name.
var1 = "GROUP_NAME"

var2 = "TIME1"


# 2. Choose a single variable to stratify the table by.
stratified_var = "Gender"


# 3. Assign the variable a label name
label(analysis_data[,var1]) <- "Treatment Group" 

label(analysis_data[,var2]) <- "Time"


# 4. Assign a stratified variable a label name for the table. 
label(analysis_data[,stratified_var]) = "Gender"

Now that you have properly assigned the variables, we will create table 1. To create the table, we will be using the table1 function. You must change the arguments depending on what you want the table to display. For example: * Table 1 without interaction:

\[ table1(\sim \underbrace{var_1 + var_2 ...}_{\text{Variable names for rows}}, \ \text{data = analysis_data} ) \]

  • Table 1 with interaction

\[ table1(\sim \underbrace{var_1 + var_2 ...}_{\text{Variable names for rows}} \underbrace{| \ stratified\_var}_{\text{Statified Variable}}, \ \text{data = analysis_data} ) \] Above is how you change the table1 function to display the table with and without a stratified variable. In the next chunk, we utilize the following steps.

  1. Create table 1. In this step you will need to use the same variable names as in the above chunk within the table1 function.

  2. Display the table

  3. Save the table in the folder “/Outputs/Tables/” with the file name “table1”.

# 1. Creates table1
tbl1 <- table1(~ TIME1 + GROUP_NAME| Gender
               , data = analysis_data)


# 2. Displays table 1
tbl1
Female
(N=44)
Male
(N=42)
Overall
(N=86)
Time
End 14 (31.8%) 15 (35.7%) 29 (33.7%)
Onset 15 (34.1%) 14 (33.3%) 29 (33.7%)
PreSymp 15 (34.1%) 13 (31.0%) 28 (32.6%)
Treatment Group
1902 15 (34.1%) 14 (33.3%) 29 (33.7%)
Control 14 (31.8%) 14 (33.3%) 28 (32.6%)
WT 15 (34.1%) 14 (33.3%) 29 (33.7%)
# 3. saves table 1 
t1flex(tbl1) %>%
  save_as_docx(path = paste0("../Outputs/Tables/","table1.docx"))

Principal Component Analysis

In general, Principal component analysis (PCA) reduces the number of variables in a dataset while preserving as much information from the data as possible. At a high level, PCA is constructed such that the first principal component accounts for the largest possible amount of variance within the data. The second principal component accounts for the largest amount of remaining variance, and so on. Additionally, each of the principal components produced by PCA is uncorrelated with each of the other principal components. At a high level, PCA allows us to visualize sources of variation in the data.

Here, we will graph the first two principal components of our dataset. In the PCA plot, we can label each point based on variables from the metadata.

The following chunk runs PCA on the scale_peak_data matrix in the following steps:

1.Create a vector of non metabolite variables that are in the the analysis data set. These variables include information about the experimental conditions.

  1. Create a data set for the principle component analysis which only includes information on the metabolites.

  2. Run PCA of the pca_dat matrix containing only the metabolites.

  3. Define the plot labels. This is a character string matching the name of one of the variables in the analysis_data which will be used to label points on the PCA graph.

  4. Create and display the PCA plot.

  5. Save the PCA plot in the “Outputs/Plots” folder with the file name PCA.pdf

# 1. Create the non-metabolite vector. 
non_metabolite <- c("PARENT_SAMPLE_NAME",
                             "GROUP_NAME",
                             "TIME1",
                             "Gender")


# 2. Create PCA data containing only metabolite data
pca_dat <- analysis_data[,!names(analysis_data) %in% non_metabolite]


# 3. Run PCA of the pca_dat matrix containing only the metabolites.   
res.pca <- PCA(pca_dat, 
               graph = F)


# 4. Define labels
meta_var = "Gender"


# 5. Create figure 
fviz_pca_ind(res.pca, 
             label = "none",
             habillage = as.factor(analysis_data[,meta_var])) 

# 6. Save figure
ggsave(paste0("../Outputs/Plots/","PCA.pdf"), width = 10, height = 10)

Suppose you notice a variable with clearly separated groups, and is not a variable of interest. In that case, you may want to consider stratifying your analysis downstream by the values of that variable. For example, if we are looking at the effects of a specific drug, and we notice in the above plot that the samples are grouped by sex, then we may want to consider stratifying the analysis by male and female.

Heatmaps

Another exploratory analysis tool we can use is heatmaps, whih is a gridded plot based on the x-axis- and y-axis labels. The color of the spot on the grid is based on the value’s intensity. For our heatmap, the x-axis will be the samples, and the y-axis will be the metabolites. The values determining the colors will be the log peak value for each chemical in each observation. We can order our observations to see intensity patterns based on our experimental conditions. This is another way of visualizing sources of variation within our data. To do this, we need to merge the chemical annotation, meta, and scaled peak data. Then we need to arrange the data based on our experimental conditions. The create_heatmap_Data function in the R folder will help us prepare the data for the heatmap. To do this the create_heatmap_Data function takes three arguments.

  • Analysis data (analysis_data)

  • Heatmap Variables (heatmap_variables) - defines the variables used to order the samples.

  • … -

The main utility of create_heatmap data is in (…), which allows you to arrange the data based on experimental conditions. If you are familiar with dplyr, the arrange function orders samples based on the arguments passed into (…). If you are unfamiliar with dplyr, an overview of the arrange function is here.

We will first look at the heatmap with all metabolites. The resulting heatmap saves into the Outputs/Plots folder under the filename “AllMetabolites.pdf.”

In this chunk we complete the following steps.

  1. Define the heatmap_variables that will be used to order the heatmap

  2. Run the create_heatmap_Data function defined in the R script folder. This function is going to produce the matrices needed for the heatmap

  3. Create a palette for the heatmap. By default we are using a red/blue palette.

  4. Create the vals and labels matrices that will define the values of the heatmap and the labels of the heatmap.

  5. Display heatmap

  6. Save heatmap in the “Outputs/Plots” folder under the filename “Heatmap_all”

# 1. define meta analysis variables
heatmap_variables <-   c("GROUP_NAME",
                             "TIME1",
                             "Gender")


# 2. Create heatmap data
heatmap_data <- create_heatmap_Data(analysis_data,heatmap_variables = heatmap_variables ,
                               Gender, GROUP_NAME, desc(TIME1)) # Arranges data frame for heatmap (...)



# 3. Heat map colors 
palette <- colorRampPalette(rev(brewer.pal(10, "RdBu")))(256)



# 4. Values for heatmap
vals = heatmap_data$heatmap_data_vals

meta = heatmap_data$heatmap_variables


# 5. Create heatmap and save. 
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
         annotation_col = meta, show_rownames = F, border_color = NA)

# 6. Save heatmap
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
         annotation_col = meta, show_rownames = F, border_color = NA,filename = paste0("../Outputs/Plots/","Heatmap_all.pdf"))

Heatmap with top 50 metabolites

Looking at the heatmap created in the previous chunk with all of the metabolites can make it challenging to determine which line goes with which metabolite. To overcome this, we can filter the metabolites based on which metabolites have the highest mean log-scaled peak data across all observations. In the heatmap below, we select the top 50 metabolites. Once we know the top 50 metabolites with the highest mean value, we can filter the peak_data_log data only to include those metabolites. Then we can run the create_heatmap function again and create the heatmap data. To do this the following steps are completed:

  1. Define the heatmap_variables to use in the heatmap

  2. Select the top 50 metabolites from the analysis data based off of mean value.

  3. Filter the analysis_data to only include the top 50 metabolites

  4. Generate heatmap data using the create_heatmap_Data function

  5. Create a palette for the heatmap. By default, we are using a red/blue palette.

  6. Create the vals and the labels that will define the values of the heatmap and the labels of the heatmap.

  7. Display heatmap

  8. Save heatmap in the “Outputs/Plots” folder under the filename “HeatmapTop50Metabolites”

# 1. Define meta analysis variables
heatmap_variables <- c("GROUP_NAME", "TIME1", "Gender")


# 2. Find the top 50 metabolites
select_variables <- analysis_data %>% 
  select(-all_of(c("PARENT_SAMPLE_NAME",heatmap_variables))) %>%
  summarise(across(everything(),\(x) mean(x,na.rm = T))) %>%
  pivot_longer(cols = everything()) %>%
  arrange(desc(value)) %>%
  slice(c(1:50))


# 3. Filter to the top 50 metabolites
analysis_data_top50 <- analysis_data %>%
  select(all_of(c("PARENT_SAMPLE_NAME", heatmap_variables)), select_variables$name) 
  

# 4. Create heatmap data
heatmap_data <- create_heatmap_Data(analysis_data_top50,
                                    heatmap_variables = heatmap_variables ,
                               Gender, GROUP_NAME, desc(TIME1)) # Arranges data frame for heatmap (...)


# 5. Heat map colors 
palette <- colorRampPalette(rev(brewer.pal(10, "RdBu")))(256)


# 6. Values for heatmap
vals = heatmap_data$heatmap_data_vals

meta = heatmap_data$heatmap_variables


# 7. Create heatmap and save. 
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
         annotation_col = meta, show_rownames = F, border_color = NA)

# 8. Save heatmap
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
         annotation_col = meta, show_rownames = F, border_color = NA,filename = paste0("../Outputs/Plots/","HeatmapTop50Metabolites.pdf"))

Sub-pathway Analysis

Before diving in, lets load the matrices that we will need for the Sub-pathway Analysis section. The two matrices we will need are:

  1. Analysis data

  2. Chemical annotation data (chem_data): Contains all of the information about the metabolites, including which sub-pathway and super-pathway they belong to.

# 1. Path to Metabolon .xlsx file.
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"

# 2. Analysis Data
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)


# 3. Read chemical annotations
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")

In the chemical annotation file, we will see that each metabolite is within a sub-pathway, and each sub-pathway is within a super-pathway. There are several metabolites within each sub-pathways and several sub-pathways within each Super-pathway. We can utilize an Analysis of variance (ANOVA) model to test for a difference in peak intensities between the treatment groups at the metabolite level, which is already part of the Metabolon analysis. However, since multiple metabolites are within a sub-pathway, it is challenging to test if the treatment affected the peak data at the sub-pathway level. For this, we will be utilizing a combined fisher test. The combined Fisher test combines the p-values for each metabolite within a sub-pathway to test the same hypothesis at the sub-pathway level.

To test our hypothesis at the sub-pathway level, we first have to form our hypothesis at the metabolite level. For each metabolite, we test three models.

1.) Interaction: \(log Peak = Treatment + block + Treatment*block\)

2.) Parallel: \(log Peak = Treatment + block\)

3.) Single: \(log Peak = Treatment\)

For the interaction model, we are focusing only on the interaction term “Treatment*block” for this to test if there is a significant interaction between our treatment and the block variable. The parallel model is testing if the block variable is explaining a significant amount of the metabolite variance, and the treatment model is testing if the treatment explains a significant proportion of the variance for each metabolite.

After running these three models for each metabolite, we will test at the sub-pathway level by combining the p-values for each metabolite within the sub-pathway for each model. We compute a chi-squared statistic to test at the sub-pathway level. For each model, we compute the chi-squared statistic by:

\[ \tilde{X} = -2\sum_{i=1}^k ln(p_i) \] where \(k\) is the number of metabablites in the sub-pathway. We can get a p-value from \(P(X \geq\tilde{X})\), knowing that \(\tilde{X}\sim \chi^2_{2k}\).

Assumptions

Since we are first testing each metabolite utilizing ANOVA, we make the following assumptions for each metabolite,

  • Independence: Each observation is independent of all other observations. Therefore, if you have collected multiple samples from the same subject then this assumption may be violated.

  • Normality: The metabolite log-scaled intensities follow a normal distribution within each of the treatment groups.

  • Homoscedasticity: Equal variance between treatment groups.

In addition to the assumptions in the ANOVA models at the metabolite level, the Fisher’s Combined probability places an independence assumptions for each metabolite being tested within the sub-pathway. For example, when we test the interaction model at the sub-pathway level, the p-values for interaction test are independent for each metabolite within the sub-pathway.

Combined Fisher Analysis

Now that we have our analysis data, we test the metabolomic data at the sub-pathway level utilizing the pathway_analysis function defined in the R script. This function takes the following arguments:

  • Ananlysis data (analysis_data): This is a a data frame created in the “Analysis data creation” module which combines the metadata and the peak data into one analysis data set.

  • Chemical Annotation data (chem_data) provided by Metabolon.

  • Treatment variable (treat_var): This is the name of the variable in the analysis data that is the main variable of interest.

  • Block Variable (block_var): Is the name of the blocking variable in the dataset. If the the experimental design does not include a blocking variable, then the value of block_var=NULL.

Note: Metabolites with a large number of missing values may have unreliable results since the missing values were imputed with the minimum of the non-missing values of the metabolite.

To run this analysis, we utilize the following steps.

  1. Run the subpathway_analysis function.

  2. Save the pathway analysis results to the “Data/Results” folder with the name “NonStratified_Full_pathway_results”

# 1. Run the ancova function
path_dat <- subpathway_analysis(analysis_data,
                            chem_data = chem_data,
                                  block_var = "TIME1",
                                  treat_var = "GROUP_NAME")




# 2. Save results
write.csv(path_dat,file = "../Data/Results/NonStratified_Full_ancova_results.csv", row.names = F)

Stratified analysis

We may notice that we need to stratify our analysis if we believe the effects of the model are different between the levels of the stratified variable. For example, we may notice that sex is a dominant source of variation in the metabolite data, and wemay want to look at males and females separately. One way to do this is to subset the data prior to running the subpathway_analysis function.

In the following chunk you can specify a variable to stratify by. For each level of the stratified variable, we:

1.) Subset the analysis data based on the stratified variable.

2.) Run the subpathay_analysis function with the subsetted data.

3.) Save the stratified data in “Data/Results” as “analysis_data_{Val}” where {Val} is the value of the stratified variable. For example, if we stratified by sex then the data will be saved as “analysis_data_female for the females.

4.) Save the results in “Data/Results” and the file will be saved as “Statified_{Val}_subpathway_results.csv” where {Val} is the value of the variable we stratified by. For example, if we stratified by sex, then the results for females would be saved as “Stratified_female_subpathway_results.csv”

Note: Metabolites with a large number of missing values may have unreliable results since the missing values were imputed with the minimum of the non-missing values of the metabolite.

# 1. Name of the variable to stratify by.
stratified_var = "Gender"


# 3. Find unique values of this variable
uni_vals <- unique(analysis_data[,stratified_var])


# For each value perform the four steps listed above.
for(i in uni_vals){
  
  # 1. Subset analysis data
  strat_data = analysis_data[analysis_data[,stratified_var]==i,]
  
  
  # 2. Run ancova function on the subsetted data
  strat_path_results <- subpathway_analysis(strat_data,
                                        chem_data = chem_data,
                                  block_var = "TIME1",
                                  treat_var = "GROUP_NAME")
  
  # 3 Save stratified data
  write.csv(strat_data, paste0("../Data/Processed/analysis_data_",i,".csv"), row.names = F)
  
  # 4 Save results
  write.csv(strat_path_results, paste0("../Data/Results/Stratified_",i,"_subpathway_results.csv"), row.names = F)
  
  
}
## Warning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are
## unreliable

Number of significant subpathways by model type

Once we have the Overall Analysis results, we will have tested all metabolites with an interaction, parallel and Single model. We then obtain a p-values through the combined fisher analysis for each sub-pathway. In the next few chunks, we will summarize the results with a few different tables. The first summary will show the number of significant sub-pathways for each model type. To do this we:

  1. List the paths to the results files generated in the previous chunks.

  2. Provide a list to the of strata names that correspond to the path files listed in step 1. If you did not stratify your results you can call this “Non-Stratified”.

Create a for loop that does the following:

  1. Read in the results table from the path defined in step 1.

  2. Structure the names of the levels for the model types in the table.

  3. Create the table data used to count the number of significant sub-pathways for each model type.

  4. Create the table by counting the number of sub-pathways for each model type.

  5. Display the table

  6. Save the table in “Outputs/Tables” folder under the file name “NumberOfSigPathwaysByModelType_{{strata}}”. You may want to change this name to reflect the strata you are summarizing.

Note: If you only tested the sub-pathways for the Single model then only the single model is displayed.

Note: The model type for each sub-pathway is determined hiearchically. For instance, if a sub-pathway shows a significant p-value for the interaction model and the parallel model, then the sub-pathway model type is only “interaction”.

# 1. Read in Results from the analysis step. 
results_path <- c("../Data/Results/Stratified_Male_subpathway_results.csv",
                  "../Data/Results/Stratified_Female_subpathway_results.csv")


# 2. Define the names of the strata
strata <- c("Male", "Female")


for (i in 1:length(strata)) {
  
    # 3. Read in results data
    path_data <- read.csv(results_path[i], check.names = F)
    
    
    # 4. Structure levels
    if(sum(grepl("interaction",names(path_data)))==0){
      levs <- c("Single","None")
    }
    if(!sum(grepl("interaction",names(path_data)))==0){
      levs =  c("Interaction", "Parallel", "Single", "None")
    }
    
    
    # 5. Create table data
    table_data <- path_data %>%
      mutate(model = factor(model, levels = levs)) %>%
      select(sub_pathway, ends_with("_fisher"), model) %>%
      distinct()
    
    
    # 6. Create table
     sig_subPaths <- count(table_data, model) %>%
                        arrange((model)) %>%
                        flextable() %>%
                       set_header_labels(model = "Model Type", n="Count")%>%
                       theme_vanilla() %>%
                       set_table_properties(layout = "autofit") %>%
                      set_caption(caption = paste0("Sigificant Pathways by Model (",
                                                   strata[i],")."))
     
     
     # 7. Display table
     cat(knitr::knit_print(sig_subPaths))
    
     
     # 8. Save table
     save_as_docx(sig_subPaths,path=paste0(
       "../Outputs/Tables/NumberOfSigPathwaysByModelType_",strata[i],".docx"))
}     
Sigificant Pathways by Model (Male).

Model Type

Count

Interaction

84

Parallel

12

Single

2

None

12

Sigificant Pathways by Model (Female).

Model Type

Count

Interaction

6

Parallel

20

Single

19

None

65

Proportion of significant subpathways within super pathways:

Additionally, we can summarize the results by looking at the number of significant sub-pathways within a super-pathway for each model type. This summary is done by:

  1. List the paths to the results files generated in the previous chunks.

  2. Provide a list to the of strata names that correspond to the path files listed in step 1. If you did not stratify your results you can call this “Non-Stratified”.

Create a for loop that does the following:

  1. Read the results for the given strata.

  2. Structure the levels for the model types based on the types of model ran.

  3. Create the table data that has the combined fisher p-values for all models and the model type.

  4. Formulating the superpath data by:

  • Filtering all subpathways that did not have a significant subpathway.

  • Join the results with the chemical annoation data to group by the superpathway.

  • Summarize the percent of subpathways within superpathways that are significant.

  1. Display table

  2. Save table in the “Outputs/Tables” folder under the name “SigSuperPathwayPecentages_{{strata}}”

Note: The model type for each sub-pathway is determined hierarchically. For instance, if a sub-pathway shows a significant p-value for the interaction model and the parallel model, then the sub-pathway model type is only “interaction”.

# 1. Read in Results from the analysis step. 
results_path <- c("../Data/Results/Stratified_Male_subpathway_results.csv",
                  "../Data/Results/Stratified_Female_subpathway_results.csv")


# 2. Define the names of the strata
strata <- c("Male", "Female")


for(i in 1:length(strata)){
  
  # 3 Read results for the strata
  path_data <- read.csv(results_path[i], check.names = F)
  
  
  # 4. Structure levels
    if(sum(grepl("interaction",names(path_data)))==0){
      levs <- c("Single","None")
      
      cases <- expression( case_when(model == "Single" ~ single_fisher))
    }
    if(!sum(grepl("interaction",names(path_data)))==0){
      levs =  c("Interaction", "Parallel", "Single", "None")
      
      cases <- expression( case_when(model == "Interaction" ~ interaction_fisher,
                                          model == "Parallel" ~ parallel_fisher,
                                          model == "Single" ~ single_fisher))
    }
  
  
   # 5. Create table data
  table_data <- path_data %>%
    mutate(model = factor(model, levels = levs)) %>%
    select(sub_pathway, ends_with("_fisher"), model) %>%
    distinct()
  
  
    # 6. Formulate the Super-pathway results table.
  superPath <- table_data %>% 
                  filter(model != "None") %>%
                  mutate(pval = eval(cases)) %>%
                  select(sub_pathway, pval) %>%
                  right_join(chem_data %>% select(SUPER_PATHWAY, SUB_PATHWAY) %>% distinct(),
                            by = c("sub_pathway" = "SUB_PATHWAY"))  %>%
                  filter(!is.na(sub_pathway)) %>%
                  mutate(sig = ifelse(is.na(pval), 0, 1)) %>%
                  group_by(SUPER_PATHWAY) %>%
                  summarise(percent_significant = round(mean(sig) * 100, 2),
                            number_significant = sum(sig),
                            pathway_count = n()) %>%
                  ungroup() %>% arrange(-percent_significant) %>%
                  transmute(SUPER_PATHWAY, percent_significant = paste0(number_significant, " / ", pathway_count, 
                                                                        " (", percent_significant, "%)")) %>%
                  flextable() %>%
                  set_header_labels(SUPER_PATHWAY="Super Pathway", percent_significant = "Percent Significant") %>%
                  set_table_properties(layout = "autofit") %>%
                  set_caption(caption = paste0("Proportion of significant subpathways within super-pathways (",
                                                   strata[i],").")) %>%
                  theme_vanilla()
  
  
  # 7. Display table
  cat(knitr::knit_print(superPath))
  
  
  # 8. Save table
  save_as_docx(superPath,path = paste0("../Outputs/Tables/SigSuperPathwayPecentages_",strata[i],".docx"))
}     
Proportion of significant subpathways within super-pathways (Male).

Super Pathway

Percent Significant

Amino Acid

16 / 16 (100%)

Energy

2 / 2 (100%)

Nucleotide

8 / 8 (100%)

Partially Characterized Molecules

1 / 1 (100%)

Xenobiotics

5 / 5 (100%)

Carbohydrate

7 / 8 (87.5%)

Lipid

46 / 53 (86.79%)

Cofactors and Vitamins

9 / 11 (81.82%)

Peptide

3 / 5 (60%)

Proportion of significant subpathways within super-pathways (Female).

Super Pathway

Percent Significant

Xenobiotics

5 / 5 (100%)

Amino Acid

13 / 16 (81.25%)

Peptide

3 / 5 (60%)

Nucleotide

3 / 8 (37.5%)

Lipid

16 / 53 (30.19%)

Carbohydrate

2 / 8 (25%)

Cofactors and Vitamins

2 / 11 (18.18%)

Energy

0 / 2 (0%)

Partially Characterized Molecules

0 / 1 (0%)

List of subpathways that were found to have significant Fisher method p-values.

The next table we create will show all of the significant sub-pathways and their model type. We do this by.

  1. List the paths to the results files generated in the previous chunks.

  2. Provide a list to the of strata names that correspond to the path files listed in step 1. If you did not stratify your results you can call this “Non-Stratified”.

Create a for-loop that does the following:

  1. Reads in the results from the path listed

  2. Structure model type levels.

  3. Create pathway data that contains the sub-pathway, model type and the p-value for the model type with the the pathways that were non-significant filtered out.

  4. Format table

  5. Display table

  6. Save table in the “Outputs/Tables” folder with the name “SigSubpathwayTable_{{strata}}”

# 1. Read in Results from the analysis step. 
results_path <- c("../Data/Results/Stratified_Male_subpathway_results.csv",
                  "../Data/Results/Stratified_Female_subpathway_results.csv")


# 2. Define the names of the strata
strata <- c("Male", "Female")


# Create for loop
for (i in 1:length(strata)){
  
  
  # 3 Read Path data
  path_data <- read.csv(results_path[i])
  
  
  # 4. Structure levels
  if(sum(grepl("interaction",names(path_data)))==0){
    levs <- c("Single","None")
    
    cases <- expression( case_when(model == "Single" ~ single_fisher))
  }
  if(!sum(grepl("interaction",names(path_data)))==0){
    levs =  c("Interaction", "Parallel", "Single", "None")
    
    cases <- expression( case_when(model == "Interaction" ~ interaction_fisher,
                                        model == "Parallel" ~ parallel_fisher,
                                        model == "Single" ~ single_fisher))
  }

  

  # 5. Create pathway table
  pathway_table <- path_data %>%
    mutate(model = factor(model, levels = levs)) %>%
    select(sub_pathway, ends_with("_fisher"), model) %>%
    distinct() %>% 
    filter(model != "None") %>%
    mutate(pval = eval(cases)) %>%
    select(sub_pathway, model, pval) %>%
    arrange(model, pval)
  
  
  # 6. Format table
  path_tab = pathway_table %>%
    flextable() %>%
    set_header_labels(sub_pathway="Subpathway", model = "Model Type", pval = "P-value") %>%
    set_table_properties(layout = "autofit") %>%
    colformat_double(digits = 3) %>%
    theme_vanilla() %>%
    set_caption(caption = paste0("Significant Subpathways (",
                                                   strata[i],")."))
  
  
  # 7. Display table
  cat(knitr::knit_print(path_tab))
  
  
  # Save table
  save_as_docx(path = paste0("../Outputs/Tables/SigSubpathwayTable_",strata[i],".docx"))
}
Significant Subpathways (Male).

Subpathway

Model Type

P-value

Glutamate Metabolism

Interaction

0.000

Polyamine Metabolism

Interaction

0.000

Nicotinate and Nicotinamide Metabolism

Interaction

0.000

Fatty Acid, Dihydroxy

Interaction

0.000

Tryptophan Metabolism

Interaction

0.000

Lysine Metabolism

Interaction

0.000

TCA Cycle

Interaction

0.000

Purine Metabolism, Guanine containing

Interaction

0.000

Leucine, Isoleucine and Valine Metabolism

Interaction

0.000

Glycolysis, Gluconeogenesis, and Pyruvate Metabolism

Interaction

0.000

Primary Bile Acid Metabolism

Interaction

0.000

Purine Metabolism, (Hypo)Xanthine/Inosine containing

Interaction

0.000

Long Chain Polyunsaturated Fatty Acid (n3 and n6)

Interaction

0.000

Medium Chain Fatty Acid

Interaction

0.000

Methionine, Cysteine, SAM and Taurine Metabolism

Interaction

0.000

Chemical

Interaction

0.000

Purine Metabolism, Adenine containing

Interaction

0.000

Urea cycle; Arginine and Proline Metabolism

Interaction

0.000

Alanine and Aspartate Metabolism

Interaction

0.000

Sterol

Interaction

0.000

Phospholipid Metabolism

Interaction

0.000

Creatine Metabolism

Interaction

0.000

Secondary Bile Acid Metabolism

Interaction

0.000

Sphingolipid Synthesis

Interaction

0.000

Gamma-glutamyl Amino Acid

Interaction

0.000

Food Component/Plant

Interaction

0.000

Fatty Acid, Dicarboxylate

Interaction

0.000

Long Chain Saturated Fatty Acid

Interaction

0.000

Pyrimidine Metabolism, Orotate containing

Interaction

0.000

Long Chain Monounsaturated Fatty Acid

Interaction

0.000

Vitamin A Metabolism

Interaction

0.000

Pyrimidine Metabolism, Uracil containing

Interaction

0.000

Pentose Metabolism

Interaction

0.000

Pyrimidine Metabolism, Cytidine containing

Interaction

0.000

Tocopherol Metabolism

Interaction

0.000

Endocannabinoid

Interaction

0.000

Aminosugar Metabolism

Interaction

0.000

Fatty Acid, Monohydroxy

Interaction

0.000

Glycerolipid Metabolism

Interaction

0.000

Ceramides

Interaction

0.000

Phosphatidylethanolamine (PE)

Interaction

0.000

Phosphatidylinositol (PI)

Interaction

0.000

Phosphatidylcholine (PC)

Interaction

0.000

Sphingomyelins

Interaction

0.000

Dipeptide

Interaction

0.000

Monoacylglycerol

Interaction

0.000

Lysoplasmalogen

Interaction

0.000

Lysophospholipid

Interaction

0.000

Fatty Acid Metabolism (Acyl Carnitine, Long Chain Saturated)

Interaction

0.000

Fatty Acid Metabolism (Acyl Carnitine, Medium Chain)

Interaction

0.000

Ascorbate and Aldarate Metabolism

Interaction

0.000

Fatty Acid Metabolism (Acyl Glycine)

Interaction

0.000

Fatty Acid Metabolism (Acyl Carnitine, Monounsaturated)

Interaction

0.000

Partially Characterized Molecules

Interaction

0.000

Hexosylceramides (HCER)

Interaction

0.000

Fatty Acid, Branched

Interaction

0.000

Diacylglycerol

Interaction

0.000

Fatty Acid Metabolism (Acyl Carnitine, Polyunsaturated)

Interaction

0.000

Fatty Acid Metabolism (Acyl Carnitine, Hydroxy)

Interaction

0.000

Fatty Acid, Amino

Interaction

0.000

Fatty Acid Metabolism (Acyl Carnitine, Dicarboxylate)

Interaction

0.000

Plasmalogen

Interaction

0.000

Dihydrosphingomyelins

Interaction

0.000

Lactoyl Amino Acid

Interaction

0.000

Unknown Metabolite

Interaction

0.000

Pentose Phosphate Pathway

Interaction

0.000

Glycosyl PE

Interaction

0.002

Tyrosine Metabolism

Interaction

0.010

Histidine Metabolism

Interaction

0.010

Hemoglobin and Porphyrin Metabolism

Interaction

0.010

Sphingosines

Interaction

0.010

Glycine, Serine and Threonine Metabolism

Interaction

0.010

Glutathione Metabolism

Interaction

0.010

Vitamin B6 Metabolism

Interaction

0.010

Thiamine Metabolism

Interaction

0.010

Fructose, Mannose and Galactose Metabolism

Interaction

0.010

Phenylalanine Metabolism

Interaction

0.020

Fatty Acid Metabolism (also BCAA Metabolism)

Interaction

0.020

Disaccharides and Oligosaccharides

Interaction

0.033

Ketone Bodies

Interaction

0.034

Mevalonate Metabolism

Interaction

0.040

Riboflavin Metabolism

Interaction

0.040

Tetrahydrobiopterin Metabolism

Interaction

0.040

Guanidino and Acetamido Metabolism

Interaction

0.040

Drug - Topical Agents

Parallel

0.000

Pyrimidine Metabolism, Thymine containing

Parallel

0.000

Dihydroceramides

Parallel

0.000

Eicosanoid

Parallel

0.000

Benzoate Metabolism

Parallel

0.000

Phosphatidylserine (PS)

Parallel

0.000

Dipeptide Derivative

Parallel

0.000

Docosanoid

Parallel

0.001

Glycogen Metabolism

Parallel

0.002

Purine and Pyrimidine Metabolism

Parallel

0.008

Inositol Metabolism

Parallel

0.010

Drug - Other

Parallel

0.040

Phosphatidylglycerol (PG)

Single

0.001

Oxidative Phosphorylation

Single

0.026

Significant Subpathways (Female).

Subpathway

Model Type

P-value

Purine Metabolism, (Hypo)Xanthine/Inosine containing

Interaction

0.000

Pentose Phosphate Pathway

Interaction

0.004

Glycogen Metabolism

Interaction

0.011

Food Component/Plant

Interaction

0.020

Lactoyl Amino Acid

Interaction

0.040

Drug - Other

Interaction

0.040

Nicotinate and Nicotinamide Metabolism

Parallel

0.000

Primary Bile Acid Metabolism

Parallel

0.000

Chemical

Parallel

0.000

Histidine Metabolism

Parallel

0.000

Hemoglobin and Porphyrin Metabolism

Parallel

0.000

Secondary Bile Acid Metabolism

Parallel

0.000

Sphingolipid Synthesis

Parallel

0.000

Eicosanoid

Parallel

0.000

Benzoate Metabolism

Parallel

0.000

Acetylated Peptides

Parallel

0.000

Phosphatidylserine (PS)

Parallel

0.000

Hexosylceramides (HCER)

Parallel

0.000

Diacylglycerol

Parallel

0.000

Unknown Metabolite

Parallel

0.000

Drug - Topical Agents

Parallel

0.010

Polyamine Metabolism

Parallel

0.020

Phospholipid Metabolism

Parallel

0.030

Glutamate Metabolism

Parallel

0.040

Urea cycle; Arginine and Proline Metabolism

Parallel

0.040

Glycine, Serine and Threonine Metabolism

Parallel

0.040

Tryptophan Metabolism

Single

0.000

Lysine Metabolism

Single

0.000

Leucine, Isoleucine and Valine Metabolism

Single

0.000

Phenylalanine Metabolism

Single

0.000

Methionine, Cysteine, SAM and Taurine Metabolism

Single

0.000

Purine Metabolism, Adenine containing

Single

0.000

Tyrosine Metabolism

Single

0.000

Fatty Acid, Dicarboxylate

Single

0.000

Fatty Acid Metabolism (also BCAA Metabolism)

Single

0.000

Dipeptide

Single

0.000

Lysoplasmalogen

Single

0.000

Plasmalogen

Single

0.000

Inositol Metabolism

Single

0.010

Pyrimidine Metabolism, Uracil containing

Single

0.010

Docosanoid

Single

0.015

Guanidino and Acetamido Metabolism

Single

0.020

Lysophospholipid

Single

0.020

Dipeptide Derivative

Single

0.030

Fatty Acid Metabolism (Acyl Glycine)

Single

0.040

Metabolites within significant subpathways

Now we will looks at the metabolites within significant sub-pathways, and the p-values for the significant model. In this section we will look at all of the metabolites within the top two sub-pathways based on p-values. To do this we will:

  1. List the paths to the results files generated from the sub-pathway analysis.

  2. Provide a list to the of strata names that correspond to the path files listed in step 1. If you did not stratify your results you can call this “Non-Stratified”.

  3. Name the model type that we are interested in looking at. The options are “Interaction”, “Parallel” or “Single”. This is case sensitive and must match the options.

Create a for-loop that does the following:

  1. Read in the sub-pathways analysis results for the given strata.

  2. Structure the levels for the model type in the results.

  3. Create the data used for the tables which includes sub-pathway, the fisher p-values, and the model type for the sub-pathway.

  4. Create the top sub-pathways data which will group by model type and order by the sub-pathways based on p-value. In this code we are only going to focus on the top two sub-pathways. If you want to focus on more sub-pathways you can edit the “slice” function.

  5. Create a list of the top two sub-pathways for the specified model type.

Create a second for loop which:

  1. Create the table which shows all the metabolite p-values for the model type.

  2. Diplay the table

  3. Save the table in “Outputs/Tables” with the file name “Metabolite_model_pvalues_{{pathway}}_{{strata}}”

# 1. Read in Results from the analysis step. 
results_path <- c("../Data/Results/Stratified_Male_subpathway_results.csv",
                  "../Data/Results/Stratified_Female_subpathway_results.csv")


# 2. Define the names of the strata
strata <- c("Male", "Female")


# 3 Model type (Interaction, Parallel, Single)
mod = "Interaction"


# Create for loop
for(i in 1:length(strata)){
  # 4. Read in table data and results from overall analysis. 
  path_data <- read.csv(results_path[i], check.names = F)


 
  # 5. Structure levels
  if(sum(grepl("interaction",names(path_data)))==0){
    levs <- c("Single","None")
    
    cases <- expression( case_when(model == "Single" ~ single_fisher))
  }
  if(!sum(grepl("interaction",names(path_data)))==0){
    levs =  c("Interaction", "Parallel", "Single", "None")
    
    cases <- expression( case_when(model == "Interaction" ~ interaction_fisher,
                                        model == "Parallel" ~ parallel_fisher,
                                        model == "Single" ~ single_fisher))
  }


   # 6. Create table data
  table_data <- path_data %>%
    mutate(model = factor(model, levels = levs)) %>%
    select(sub_pathway, ends_with("_fisher"), model) %>%
    distinct()



  # 7. Create top pathways table 
  top_pathway_table <- table_data %>% 
    filter(model != "None") %>%
    mutate(pval = eval(cases)) %>%
    select(sub_pathway, model, pval) %>%
    arrange(model, pval) %>%
    filter(model == mod) %>%
    group_by(model) %>% slice(1:2) %>% ungroup() %>%
    left_join(select(path_data, sub_pathway, chem_name, ends_with("_pval")),
              by = "sub_pathway") %>%
    distinct()


  # 8. list the top two subpathways. 
  pathways <- unique(top_pathway_table$sub_pathway)
  
    # Create pathways for look
    for(j in 1:length(pathways)){
      
      
      # 9. Create table for specific model type
      top_paths_table <- top_pathway_table %>% 
            filter(sub_pathway == pathways[j]) %>%
            select(chem_name, all_of(paste0(tolower(mod),"_pval"))) %>%
            arrange(all_of(paste0(tolower(mod),"p_val"))) %>%
            flextable() %>%
            set_header_labels(chem_name="Metabolite") %>%
            set_table_properties(layout = "autofit") %>%
            add_header_row(values = c(pathways[j]),colwidths = 2) %>%
            theme_vanilla() %>%
            align(i=1,align = "center", part = "header") %>%
            colformat_double(digits = 3) %>%
            set_caption(caption = paste0("Metabolites Within Pathways of Significant ",mod, " Model(",
                                                   strata[i],")."))
    
      # 10. Display table 
      cat(knitr::knit_print(top_paths_table))
      
      
      # 11. Save table 
    save_as_docx(top_pathway_table, 
                 path = paste0("../Outputs/Tables/Metabolite_model_pvalues_",gsub("[^A-Za-z0-9 ]","",pathways[j]),"_",strata[i],".docx"))
      
    }
}  
## Warning: There was 1 warning in `arrange()`.
## ℹ In argument: `..1 = all_of(paste0(tolower(mod), "p_val"))`.
## Caused by warning:
## ! Using `all_of()` outside of a selecting function was deprecated in tidyselect
##   1.2.0.
## ℹ See details at
##   <https://tidyselect.r-lib.org/reference/faq-selection-context.html>
Metabolites Within Pathways of Significant Interaction Model(Male).

Glutamate Metabolism

Metabolite

interaction_pval

S-1-pyrroline-5-carboxylate

0.356

glutamate

0.233

glutamine

0.023

N-acetylglutamate

0.091

N-acetylglutamine

0.091

N-acetyl-aspartyl-glutamate (NAAG)

0.103

alpha-ketoglutaramate*

0.075

4-hydroxyglutamate

0.248

N-methyl-GABA

0.076

carboxyethyl-GABA

0.061

Metabolites Within Pathways of Significant Interaction Model(Male).

Polyamine Metabolism

Metabolite

interaction_pval

putrescine

0.181

spermidine

0.061

N-acetylputrescine

0.042

5-methylthioadenosine (MTA)

0.131

spermine

0.091

4-acetamidobutanoate

0.282

(N(1) + N(8))-acetylspermidine

0.034

Metabolites Within Pathways of Significant Interaction Model(Female).

Purine Metabolism, (Hypo)Xanthine/Inosine containing

Metabolite

interaction_pval

hypoxanthine

0.006

allantoic acid

0.250

inosine

0.001

inosine 5'-monophosphate (IMP)

0.859

allantoin

0.316

xanthine

0.021

urate

0.667

xanthosine

0.961

N1-methylinosine

0.941

Metabolites Within Pathways of Significant Interaction Model(Female).

Pentose Phosphate Pathway

Metabolite

interaction_pval

ribose 1-phosphate

0.004

Pairwise Compairisons

At the metabolite level, we can look at the pairwise comparisons for all experimental groups. In the following section we will look at the pairwise comparisons using the metabolite_pairwise function defined in the R scripts. We will first read in the analysis data needed for this analysis. Since the column names of the analysis data are numeric, we should convert all of the column names to to characters.

# 1. Read in data from the analysis above
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)

# 2. Make the variable names charactors
names(analysis_data) <- as.character(names(analysis_data))

Once we have the analysis data loaded in, we can now look at the pairwise comparisons for each of our experimental groups. To do this we will be using the metabolite_pairwise function defined in the R scripts. This function will analyze each metabolite individually. For each metabolite, the the metabolite_pairwise function will first test if any of the experimental groups explained a significant proportion of the variance in the metabolite using an F-test. Since we will be looking at multiple comparisons for the metabolite it is good practice to first look at the over-all p-value from the F-test before looking at the pairwise comparisons.

The metabolite_pairwise function then looks at all pairwise comparisons utilizing the emmeans package. The metabolite_pairwise function returns a data frame with the metabolite, overall p-value, estimated difference in means for all experimental groups, and the p-value which test if the difference the groups is significantly different.

The metabolite_pairwise function has three arguments:

In the next chunk we will stratify the analysis by running the metabolite_pairwise function for each stata in our analysis data set. To do this we:

  1. Define the variable in the analysis data that we want to stratify our analysis by.

  2. Define the form variable for the metabolite_pairwise function.

  3. Create a list of metabolites that we want to pairwise analysis for. By default, we are going to use all metabolites in the analysis data.

Create a for look that:

  1. Filters the analysis_data to focus on a single strata.

  2. Run the metabolite_pairwise function for that strata.

  3. Save results in the “../Data/Results/Pairwise_Comparisons” folder with the file name “Pairwise_{{strata}}.

# 1. Define the variable in the analysis data that we want to stratify 
strata_var <- "Gender"


# 2. Define the form variable
form <- "GROUP_NAME*TIME1"


# Create a list of metabolites
analysis_variables <- c("PARENT_SAMPLE_NAME", "GROUP_NAME",
                                           "TIME1", "Gender")
metabolites <- names(analysis_data)[!names(analysis_data) %in% analysis_variables ]

stratas <- unique(analysis_data[,strata_var])
for (i in 1:length(stratas)) {
  
  # 4. Filter analysis data
  dat <- analysis_data[analysis_data[,strata_var]==stratas[i],]
  
  
  # 5. Run the metabolite_pairwise function for the strata
  strat_results <- metabolite_pairwise(form = form,
                                       data=dat,
                                       metabolites = metabolites)
  
  # 6. Save results
  write.csv(strat_results, file = paste0("../Data/Results/Pairwise_Comparisons/Pairwise_",
                                         stratas[i],".csv"), row.names = F)
  
  }
## ================================================================================
## ================================================================================

P-Value Heatmaps

We will visualize the p-values in a heatmap by having the metabolites in the row and the pairwise comparisons in the columns. The values that make up the heatmap will be the -log10 transformed p-value. This will allow the smaller pvalue to have a higher intensity in the heatmap. To generate the heatmap, the following code uses plotly which gives us the capability to make interactive plots. The heatmap that is generated will allow us to hover over the heatmap and see information about the metabolite including the metabolite name and the sub-pathway and the p-value from the overall F-test. Additionally, we group the metabolites by sub-pathway to make it easier to see if multiple metabolites within a sub-pathway are significant. To do this we:

  1. Read in the results from the pairwise comparisons. If you stratified the analysis, you may want to copy this chunk and rerun for each strata. Additionally, we will need to read in the chemical annotation file.

  2. Merge the chemical annotation fill with the results from the pairwise comparisons.

  3. Create the heatmap data that contains only the p-values for each pairwise comparison. Additionally, we -log10 transform all of the p-values.

  4. Create a matrix of “hover data” which will be what is displayed when we hover over the heatmap. Here we are choosing to display metabolite name and sub-pathway.

  5. Generate heatmap using plotly

  6. Display heatmap

  7. Save heatmap in “…/Outputs/Plots” with the file name “PaiwiseComp_pvalue_heatmap”. Since this plot is interactive, it will be save as an html file that can be opened in a web browser.

# 1. Read in pairwise results
results_pair <-read.csv("../Data/Results/Pairwise_Comparisons/Pairwise_Female.csv", check.names = F)
 
# Read chemical annotation file
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")

 

# 2. Merge the chemical annotation fill with the results from the pairwise comparisons.
data <- chem_data %>%
  select(SUB_PATHWAY,CHEMICAL_NAME,CHEM_ID) %>%
  merge(results_pair, by.x = "CHEM_ID",by.y = "Metabolite") %>%
  arrange(SUB_PATHWAY)
  

# 3. Create the heatmap data
heatmap_dat <- apply(data[,grepl("PVALS",names(data))],MARGIN = 2,
                     FUN = function(x){-log10(x)})

# 4. Create a matrix of "hover data"
hover_text <- matrix(rep(paste0("Subpathway: ",data$SUB_PATHWAY, ", ",
                                "Metabolite: ", data$CHEMICAL_NAME,
                                ", Overall P-value: ", round(data$Overall_pval,3)),
                         ncol(heatmap_dat)),nrow=length(data$SUB_PATHWAY))


# 5. Generate heatmap using plotly

p <- plot_ly(x=colnames(heatmap_dat), y=as.character(data$CHEMICAL_NAME), 
            z =as.matrix( heatmap_dat), 
            type = "heatmap",
            showscale = T, 
            hoverinfo = 'text',
            text = hover_text,
            xaxis = list(ntick= ncol(heatmap_dat))) %>%
  layout(title = "P-value Pairwise Comparisons Heatmap")


# 6. Display heatmap
p
# 7. Save.
#htmlwidgets::saveWidget(p,paste0("../Outputs/Plots/PaiwiseComp_pvalue_heatmap_",".html"))

Additionally we can create a heatmap which focuses on the metabolites that had a significant pairwise comparison. This heatmap is useful because we can condition the heatmap to show which metabolites had a significant amount of variance explained by the treatment groups prior to looking at the pairwise comparisons. For each metabolite, if the overall pvalue from the F statistic was less than 0.05 AND the pairwise comparison p-value was less than 0.05, then the heatmap value is set to 1 and 0 otherwise. In the following code we:

  1. Read in the results from the pairwise comparisons. If you stratified the analysis, you may want to copy this chunk and rerun for each strata. Additionally, we will need to read in the chemical annotation file.

  2. Merge the chemical annotation fill with the results from the pairwise comparisons. In this step we group the metabolites by pathway.

  3. Create the heatmap data where the value is 1 if the overall p-value is less than 0.05 and the pairwise comparison p-value is less than 0.05.

  4. Create a matrix of “hover data” which will be what is displayed when we hover over the heatmap. Here we are choosing to display metabolite name and sub-pathway.

  5. Generate heatmap using plotly

  6. Display heatmap

  7. Save heatmap in “…/Outputs/Plots” with the file name “SigPaiwiseComp_pvalue_heatmap”. Since this plot is interactive, it will be save as an html file that can be opened in a web browser.

# 1. Read in pairwise results
results_pair <-read.csv("../Data/Results/Pairwise_Comparisons/Pairwise_Female.csv", check.names = F)
  
# Read chemical annotation file
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")
  

# 2. Merge the chemical annotation fill with the results from the pairwise comparisons.
pval_col_names <- names(results_pair)[grepl("PVALS",as.character(names(results_pair)))]
data <- chem_data %>%
  select(SUB_PATHWAY,CHEMICAL_NAME,CHEM_ID) %>%
  merge(results_pair, by.x = "CHEM_ID",by.y = "Metabolite") %>%
  arrange(SUB_PATHWAY)
  

# 3. Create the heatmap data
heatmap_dat <- data %>%
  mutate(across(all_of(pval_col_names),~ ifelse(.x <0.05& Overall_pval<0.05,1,0))) %>%
  select(all_of(pval_col_names))

# 4. Create a matrix of "hover data"
hover_text <- matrix(rep(paste0("Subpathway: ",data$SUB_PATHWAY, ", ",
                                "Metabolite: ", data$CHEMICAL_NAME,
                                ", Overall P-value: ", round(data$Overall_pval,3)),
                         ncol(heatmap_dat)),nrow=length(data$SUB_PATHWAY))


# 5. Generate heatmap using plotly
p <- plot_ly(x=colnames(heatmap_dat), y=as.character(data$CHEMICAL_NAME), 
            z =as.matrix( heatmap_dat), 
            type = "heatmap",
            showscale = T, 
            hoverinfo = 'text',
            text = hover_text,
            showscale=F,
            xaxis = list(ntick= ncol(heatmap_dat))) %>%
  layout(title = "Significant Pairwise Comparisons Heatmap")


# 6. Display heatmap
p
# 7. Save heatmap
#htmlwidgets::saveWidget(p,paste0("../Outputs/Plots/SigPaiwiseComp_pvalue_heatmap_"
 #                                ,".html"))

Box Plots and Line Plots

Visualizations of the data can help us see the underlying trends. Two useful visualization is boxplots and line graphs. Metabolon provides customers with similar box plots, but the following will allow to look at the boxplots specifically with a sub-pathway.

Boxplots

Suppose there is a specific subpathway that we would like to visualize. In that case, we can compare the treatment groups for each metabolite within that sub-pathway using the subpathway_box plots function defined in the R script. T his function takes the following arguments.

  • Analysis data (analysis_data): This is the analysis data used for the modeling in the primary analysis. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.

  • Chemical annotation data (chem_data): The chemical annotation data. This argument is necessary to link the columns of the analysis data to the sub pathways. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.

  • Pathway (pathway): This is the name of the sub-pathway of interest.

  • X: This the the name of the variable in the meta data that is used for the X axis of the box plots.

  • Group By (groupBy): This is a grouping variable. As a recommendation the treatment groups should be used in the groupBy argument as this will provide a different color for each of the treatments making it easier to identify.

  • … : At the end of this function you can provide additional arguments for filtering the analysis data. In the examples below we are filtering so the box plots only focus on males and only two treatment groups.

The output of the subpathway_boxplot function is box plots for each of the metabolites within the specified subpathway. The labels for the box plots are default based on the data created inside the subpathway_boxplots function. Since these box plots utilize ggplot, we can customize all labels using the. labs function from ggplot.

Box plots example

Here we look at the boxplots for each metabolite within a specified subpathway utilizing the subpathway_boxplots function. The subpathway_boxplots function takes the following arguments

# 1. Read in analysis data
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)

# 2. Reach in chemical annotation data
# Read chemical annotation file
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")
  
# 3. Specify subpathway of interest. 
subpath = "Gamma-glutamyl Amino Acid"


# 4. Run the subpathway_boxplots function
box_1 <- subpathway_boxplots(analysis_data =  analysis_data,
                chem_data = chem_data,
                subpathway = subpath,
                X = TIME1,
                groupBy = GROUP_NAME,
                # Additional analysis data filtering options. 
                Gender == "Male", GROUP_NAME != "Control")


# 5. Customize labels for the boxplots
box_1 +
  labs(x = "Time", y="Scaled Intensity", color="Treatment Group",
       title = "Metabolite boxplots for the Gamma-glutamyl Amino Acid Subpathway")

# 6. Save boxplots. You can change the file type for example change pdf to png. 
ggsave(filename = paste0("../Outputs/Plots/Metabolite_boxplots_",subpath,".pdf"))
## Saving 10 x 10 in image

Line plots

Another useful visualization is line plots. These plots are useful for seeing trends between an ordered variable and the scaled intensity. Line plots can be useful for analyzing trends across the different treatment groups. Like the box plots, we are interested in seeing trends for all metabolites within a sub-pathway. To do this, we can utilize the “subpathway_lineplots” function defined in the R scripts. This function takes the following arguments.

  • Analysis data (analysis_data): This is the analysis data used for the modeling in the primary analysis. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.

  • Chemical annotation data (chem_data): The chemical annotation data. This argument is necessary to link the columns of the analysis data to the subpathways. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.

  • Pathway (pathway): This is the name of the subpathway of interest.

  • X: This the name of the variable in the metadata that is used for the X axis of the line plots.

  • Group By (groupBy): This is a grouping variable. As a recommendation the treatment groups should be used in the groupBy argument as this will provide a different color for each of the treatments making it easier to identify.

Line plot example

Below is a demonstration of utilizing the line plots to see trends between treatment groups at the metabolite level. In the chunk below we:

  1. Read in the analysis data

  2. Read in the chemical annotation data

  3. Prep data for line plots. To produce the lines we must provide a numeric variable for the x-axis. If you are looking at trends across an ordinal categorical variable you will need to convert the variable to be numeric. For example, if we want to see the trend in time between treatment groups, where time takes the ordinal values of “Pre-symptomatic”, “onset of symptoms”, and “end of symptoms”, then we will need to assign these names to the values of 1,2, and 3 respectively. Additionally, we will need to make any other desired adjustments to our data at this step.

  4. Define the sub-pathway of interest.

  5. Run the “subpathway_lineplots function”

  6. Edit the aesthetics of plots. Similar to the box plots, we are using the labs function to edit the labels for the plots. Additionally, we are utilizing the scale_x_continuous function at add labels to the tick marks for the plot that correspond with the original values.

  7. Save the plots in the folder “Outputs/Plots” under the file name “linePlots_{{subpathway}} where {{subpathway}} is the name of the subpathway defined in step 4.

# 1. Read in analysis data
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)

# 2. Reach in chemical annotation data
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")


# 3. Prep data for line plots
analysis_data <- analysis_data %>%
  filter(Gender=="Male") %>%
  mutate(TIME1 = case_when(TIME1== "PreSymp" ~ 1,
                           TIME1 =="Onset" ~ 2,
                       TIME1 == "End" ~ 3))



# 4. Define subpathway of interest. 
subpathway = "Gamma-glutamyl Amino Acid"


# 5. Run subpathway_line plots function.
line_plots <- subpathway_lineplots(analysis_data = analysis_data,
                     chem_data = chem_data,
                     subpathway = subpathway,
                     X = TIME1,
                     groupBy = GROUP_NAME)



# 6. Edit asthetics of plots
line_plots +
  scale_x_continuous(breaks = c(1, 2, 3), labels = c("PreSymp", "Onset", "End")) +
  labs(x= "Time",y="Scaled Intensity",color = "Treatment Group")
## `geom_smooth()` using formula = 'y ~ x'

# 7. Save plots 
ggsave(filename = paste0("../Outputs/Plots/LinePlots_",subpathway,".pdf"))
## Saving 12 x 12 in image
## `geom_smooth()` using formula = 'y ~ x'

Stratified line plots

We may also want to stratify the line plots by a specific condition. In the example below we stratify the line plots by gender in 10 steps.

  1. Read in the analysis data

  2. Read in the chemical annotation data

  3. Define the subpathway of interest.

  4. Prep data for line plots and filter on specific conditions in the data. In this example we are filtering the data on males. Additionally, the same processing steps from above need to be completed.

  5. Filter analysis data on second condition following the same instructions from step 3. In this example we are filtering the data on females.

  6. Run the subpathway_lineplots function for males

  7. Run the subpathway_lineplots function for females

  8. Use the ggarrange function to combine these two plots into one.

  9. Save the plot in Outputs/Plots folder with the LinePlots_{{subpathway}}_stratified.

# 1. Read in analysis data
analysis_data <- read.csv("../Data/Processed/analysis_data.csv", check.names = F)

# 2. Reach in chemical annotation data
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")




# 3. Define subpathway of interest. 
subpathway = "Gamma-glutamyl Amino Acid"


# 4. Prep data for line plots
analysis_data_male <- analysis_data %>%
  filter(Gender=="Male") %>%
  mutate(TIME1 = case_when(TIME1== "PreSymp" ~ 1,
                           TIME1 =="Onset" ~ 2,
                       TIME1 == "End" ~ 3))


# 5. Prep data for line plots
analysis_data_female <- analysis_data %>%
  filter(Gender=="Female") %>%
  mutate(TIME1 = case_when(TIME1== "PreSymp" ~ 1,
                           TIME1 =="Onset" ~ 2,
                       TIME1 == "End" ~ 3))




# 6. Create plot for males
males <- subpathway_lineplots(analysis_data = analysis_data_male,
                     chem_data = chem_data,
                     subpathway = subpathway,
                     X = TIME1,
                     groupBy = GROUP_NAME)+
    scale_x_continuous(breaks = c(1, 2, 3), labels = c("PreSymp", "Onset", "End")) +
    labs(x= "Time",y="Scaled Intensity",color = "Treatment Group")



# 7. Create plot for females
females <- subpathway_lineplots(analysis_data = analysis_data_male,
                     chem_data = chem_data,
                     subpathway = subpathway,
                     X = TIME1,
                     groupBy = GROUP_NAME)+
    scale_x_continuous(breaks = c(1, 2, 3), labels = c("PreSymp", "Onset", "End")) +
    labs(x= "Time",y="Scaled Intensity",color = "Treatment Group")


# 8. Display stratified plot
ggarrange(males,
  females,
  labels = c("Males", "Females"),
  nrow = 2,
  common.legend = T,
  vjust = 0
) 
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

# 9. Save plots 
ggsave(filename = paste0("../Outputs/Plots/LinePlots_",subpathway,"_stratified.pdf"))
## Saving 12 x 12 in image